In this task we will use nltk
package to recognize named entities and classify in a given text (in this case article about American Revolution from Wikipedia).
nltk.ne_chunk
function can be used for both recognition and classification of named entities. We will aslo implement custom NER function to recognize entities, and custom function to classify named entities using their Wikipedia articles.
In [2]:
import nltk
import numpy as np
import wikipedia
import re
Suppress wikipedia
package warnings.
In [3]:
import warnings
warnings.filterwarnings('ignore')
Helper functions to process output of nltk.ne_chunk
and to count frequency of named entities in a given text.
In [4]:
def count_entites(entity, text):
s = entity
if type(entity) is tuple:
s = entity[0]
return len(re.findall(s, text))
def get_top_n(entities, text, n):
a = [ (e, count_entites(e, text)) for e in entities]
a.sort(key=lambda x: x[1], reverse=True)
return a[0:n]
# For a list of entities found by nltk.ne_chunks:
# returns (entity, label) if it is a single word or
# concatenates multiple word named entities into single string
def get_entity(entity):
if isinstance(entity, tuple) and entity[1][:2] == 'NE':
return entity
if isinstance(entity, nltk.tree.Tree):
text = ' '.join([word for word, tag in entity.leaves()])
return (text, entity.label())
return None
Since nltk.ne_chunks
tends to put same named entities into more classes (like 'American' : 'ORGANIZATION' and 'American' : 'GPE'), we would want to filter these duplicities.
In [5]:
# returns list of named entities in a form [(entity_text, entity_label), ...]
def extract_entities(chunk):
data = []
for entity in chunk:
d = get_entity(entity)
if d is not None and d[0] not in [e[0] for e in data]:
data.append(d)
return data
Our custom NER functio from example here.
In [13]:
def custom_NER(tagged):
entities = []
entity = []
for word in tagged:
if word[1][:2] == 'NN' or (entity and word[1][:2] == 'IN'):
entity.append(word)
else:
if entity and entity[-1][1].startswith('IN'):
entity.pop()
if entity:
s = ' '.join(e[0] for e in entity)
if s not in entities and s[0].isupper() and len(s) > 1:
entities.append(s)
entity = []
return entities
Loading processed article, approximately 500 sentences. Regex substitution removes reference links (e.g. [12])
In [14]:
text = None
with open('text', 'r') as f:
text = f.read()
text = re.sub(r'\[[0-9]*\]', '', text)
Now we try to recognize entities with both nltk.ne_chunk
and our custom_NER
function and print 10 most frequent entities.
Yielded results seem to be fairly similar. nltk.ne_chunk
function also added basic classification tags.
In [15]:
tokens = nltk.word_tokenize(text)
tagged = nltk.pos_tag(tokens)
ne_chunked = nltk.ne_chunk(tagged, binary=False)
ex = extract_entities(ne_chunked)
ex_custom = custom_NER(tagged)
top_ex = get_top_n(ex, text, 20)
top_ex_custom = get_top_n(ex_custom, text, 20)
print('ne_chunked:')
for e in top_ex:
print('{} count: {}'.format(e[0], e[1]))
print()
print('custom NER:')
for e in top_ex_custom:
print('{} count: {}'.format(e[0], e[1]))
Next we would want to do our own classification, using Wikipedia articles for each named entity. Idea is to find article matching entity string (for example 'America') and then create a noun phrase from its first sentence. When no suitable article or description is found, entity classification will be 'Thing'.
In [82]:
def get_noun_phrase(entity, sentence):
t = nltk.pos_tag([word for word in nltk.word_tokenize(sentence)])
phrase = []
stage = 0
for word in t:
if word[0] in ('is', 'was', 'were', 'are', 'refers') and stage == 0:
stage = 1
continue
elif stage == 1:
if word[1] in ('NN', 'JJ', 'VBD', 'CD', 'NNP', 'NNPS', 'RBS', 'IN', 'NNS'):
phrase.append(word)
elif word[1] in ('DT', ',', 'CC', 'TO', 'POS'):
continue
else:
break
if len(phrase) > 1 and phrase[-1][1] == 'IN':
phrase.pop()
phrase = ' '.join([ word[0] for word in phrase ])
if phrase == '':
phrase = 'Thing'
return {entity : phrase}
def get_wiki_desc(entity, wiki='en'):
wikipedia.set_lang(wiki)
try:
fs = wikipedia.summary(entity, sentences=1)
except wikipedia.DisambiguationError as e:
fs = wikipedia.summary(e.options[0], sentences=1)
except wikipedia.PageError:
return {entity : 'Thing'}
#fs = nltk.sent_tokenize(page.summary)[0]
return get_noun_phrase(entity, fs)
Obivously this classification is way more specific than tags used by nltk.ne_chunk
. We can also see that both NER methods mistook common words for entities unrelated to the article (for example 'New').
Since custom_NER
function relies on uppercase letters to recognize entities, this can be commonly caused by first words in sentences.
The lack of description for entity 'America' is caused by simple way get_noun_phrase
function constructs description. It looks for basic words like 'is', so more advanced language can throw it off. This could be fixed by searching simple english Wikipedia or using it as a fallback when no suitable phrase is found on normal english Wikipedia (for example compare article about Americas on simple and normal wiki).
I also tried to search for more general verb (presen tense verb, tag 'VBZ'), but this yielded worse results. Other improvement could be simply expanding the verb list in get_noun_phrase
with other suitable verbs.
When no exact match for pair (entity, article) is found, wikipedia
module raises DisambiguationError
, which (same as disambiguation page on Wikipedia) offers possible matching pages. When this happens, first suggested page is picked. This however does not have to be the best page for given entity.
In [77]:
for entity in top_ex:
print(get_wiki_desc(entity[0][0]))
In [73]:
for entity in top_ex_custom:
print(get_wiki_desc(entity[0]))
When searching simple wiki, entity 'Americas' gets fairly reasonable description. However there seems to be an issue with handling DisambiguationError
in some cases when looking for first page in DisambiguationError.options
raises another DisambiguationError
(even if pages from .options
should be guaranteed hit).
In [83]:
get_wiki_desc('Americas', wiki='simple')
Out[83]:
In [ ]: